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ABSTRACT 

We introduce a model for simulating mutation of prokaryote DNA sequences. Using that model we can then 
evaluated traditional techniques like parsimony and maximum likelihood methods for computing phylogenetic 
relationships. We also use the model to mimic large scale genomic changes, and use this to evaluate multi- 
fractal and related information theory techniques which take into account these large changes in determining 
phylogenetic relationships. 
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1. INTRODUCTION 

In this paper we examine the stochastic evolution of prokaryotes by simulating a number of mutational events 
from changes within genes to changes in overall genome structure. We then use the resultant in silico virtual 
mutants, 1 for which we know the ancestry, to compare the accuracy of both whole genome comparisons and 
more traditional orthologous gene comparisons in deriving phylogenetic relationships. 

The relationships between various prokaryotes, bacteria and archaea, are of great interest. 2 ' 3 Since the 
seminal work by Zuckerkandl and Pauling, 4 many of these relationships have been explored by comparing the 
DNA and amino acid sequences of the organisms in question. Given a set of sequences from several species, 
we would then like to infer (with statistical significance) a phylogenetic relationship between the species. A 
critical assumption is that the sequences diverged from a common ancestor and not by other processes such 
as gene duplication 5 or gene transfer, 6 these are known as orthologues. The algorithm used to analyze the 
sequence needs to produce a measure of divergence from the common ancestor which can then be used to 
construct a variety of trees. 7 

A number of processes are well known by which prokaryotes can mutate and thus diverge from a common 
ancestor. 8 The main mechanisms we have focused on are: 

• Base substitutions, where one base pair has been replaced with a different base through some mechanism 
(such as UV irradiation with an absent or unsuccessful repair process). 

• Additions and deletions, where a base pair has been added or removed from the sequence. 

• Rearrangements, where a sequence has been inverted or shifted to another location (or both). 

• Gene transfer, where one or more genes are inserted into a prokaryote genome from another source (such 
as a phage or by recombination with a plasmid) . 
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In addition to analyzing orthologous genes, one can compare whole genomes and their features and prop- 
erties due to the large number of complete prokaryote genomes available. 9 There have been a number of 
methods proposed for using a whole genome approach to problems of phylogcny. These include measuring the 
fraction of orthologs shared between genomes and quantifying correlations between genes with respect to their 
relative positions in genomes. 10 Similarly, comparing orderings of genes between genomes has been proposed 
as an area of investigation. 11 Other techniques which we have explored in this paper include a multifractal 
approach 12 and related information theory approaches. 13, 14 

2. RESULTS 
2.1. Comparison of standard algorithms 

We have compared a known tree that we have generated by simulating mutation of the human adenosine 
a2 receptor 15 with trees produced by standard techniques such as maximum likelihood 7, 16 and maximum 
parsimony. 

The basic algorithm of our mutation simulation software is as follows: 

1. Take the original nucleotide sequence, and apply up to five insertions, up to five deletions, and up to 
five base substitutions. We use a small random number of each to simulate random mutations and 
to put some distance between each of the generations, but not too much that it becomes too easy for 
the algorithms to distinguish between them. Repeat this step n times to produce n descendants of the 
original, these we called 0, . . . , n — 1. 

2. Taking the mutations of the previous step, x, we then mutate these to produce some m descendants of 
those descendants, which we label x © 0, . . . , x © (m — l)Va; (where © denotes concatenation). 

3. Repeat the previous step several times. 

We then use the CLUSTALW software 17, 18 to perform the alignments. The algorithm, as described in 
Durbin et al. 7 is: 

1. Construct a distance matrix of all pairs of sequences by a pairwise dynamic programming alignment, 
followed by conversion of similarity scores to approximate evolutionary distances using the Kimura 
model. 19 

2. Construct a guide tree by a neighbor-joining clustering algorithm of Saitou and Nei. 20 

3. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, 
and profile- profile alignment. 

After alignment, we then generated the trees shown in Figure 1. We used the PHYLIP software 21 which 
implements the two following methods: 

• Parsimony, which generates the tree by evaluating a number of possible trees and finding the one with the 
overall minimum cost, where the cost is the number of substitutions to explain the observed sequences. 

• Maximum likelihood, which builds a tree with a maximum likelihood of occurring given a model of 
evolution and the observed sequences. The critical assumptions of the model are that base substitutions 
(and gaps) follow a Poisson process with a set of specified rates, that each site in the sequence evolves 
independently, and different lineages evolve independently. 

Both techniques worked well, however we felt that the parsimony technique worked better than the max- 
imum likelihood technique. A number of open questions have been raised about the efficacy of gap penalty 
scoring. 22 ' 23 In order to see if the incorrect portions of the tree were due to incorrect scoring of gaps, we 
repeated the steps outlined above but with no gap-producing insertions and deletions and with a corresponding 
increase in the number of substitutions (to make the distances comparative). We found no significant difference 
in the trees produced, suggesting the problems lie elsewhere. 
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(a) Tree generated using the parsimony method. 
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(b) Tree generated using the maximum likelihood method. The tree is con- 
structed to give a maximum likelihood score, and for this tree the log likelihood 
score is —4074.2. So given a probabilistic model, the probability that tree fits that 
model is e ' , which although low is higher than trees with minor variations 
that have log likelihood scores in the range —4075 to —4080. 

Figure 1. This shows two trees generated using the parsimony and maximum likelihood methods. A correct tree 
should have the numbers increasing in size going outwards from the center, with each descendant of x being x © n, for 
some n. 



2.2. Multifractal classification 



Here we use the fractal method as detailed by Yu et al., 12 ' 24 which considers the Renyi dimension D q for 
q 6 R, given by 
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The n(B) are simply the sample probabilities (in [0, 1]) of finding each of the possible substrings of codons of 
length K, e is e = 4~ K . We found a value of K = 8 works best. The space (D_i, D l7 D_ 2 ) used in conjunction 
with the neighbor joining algorithm 20 is then used to generate phylogenetic trees. 

Since the multifractal technique deals with large scale statistics, it is unreliable in dealing with the short 
sequences we analysed in the previous subsection. The question then arises, is the technique of any use 
as a whole-genome classification technique? To analyze this question, we model larger changes to bacterial 
genomes, in particular both gene transfer 6 and shuffling of operons to different positions on the genome. 11 
We have focussed on the Escherichia coli bacteria, due to the considerable amount of detailed information 
available on their genetic makeup 25 and also the mechanisms by which genetic material can be inserted into 
their genomes. 8 The E. coli bacteria we used were K12 (Genbank accession number NC_000913), 0157:H7 
(NC.002695), 0157:H7 EDL933 (NC_002655), and CFT073 (NC.004431). In the rest of this paper we simply 
use the numerical part of the accession numbers to refer to the different subspecies. The E. coli bacteria have 
a large, variable portion of foreign genes 6 so they provided an excellent test for a technique which is good at 
picking up differences in gene usage. 

The particular algorithms we used for moving entire genes around are: 

• Gene shuffling. Here we identified operons based on existing information about operons in E. coli bacteria 
which details predicted operons in E. coli. 26 Since we wanted to see how the multifractal method copes 
with shifts in general (it docs not in fact distinguish between the different types of shifting that occur), 
it is unimportant if there are some errors in finding operons. For future work on whole genome analysis, 
especially that considering the ordering of genes, 11 it may be necessary to significantly improve this rate 
if our simulator is to accurately represent actual biological processes. 

• Recombination. Here we identified regions where E. coli restriction enzymes could cleave the sequence, 
and then picked them at random to insert one or more sequences. Both the enzymes used and the 
additional sequences could be specified by the user (for example, one could insert the adenosine a2 
receptor gene used in the previous subsection and insert that using the Eco57I restriction enzyme 27 ). 

As before, we also introduced a number of base substitutions, insertions, and deletions into the sequence, this 
time we used a random number of each of up to thirty. 

We then considered both the parsimony and maximum likelihood methods for comparing a selection of 
both DNA and amino acid sequences from a set of derived mutants, and used the multifractal with neighbor 
joining method to compare the whole genome sequences. The results of using the multifractal method this are 
shown in Figure 2. 
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(a) The actual tree is, as generated by our software 
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(b) This is the tree as computed using the multifractal measure and 
the neighbor joining algorithm. Note that the tree is correct for the 
002695 and 000913 families, but has significant problems distinguishing 
between the 002655 and 004431 families. 



Figure 2. This shows two trees generated using the parsimony and maximum likelihood methods. A correct tree 
should have the numbers increasing in size going outwards from the center, with descendants of x labelled x.n, n = 0, 1. 



2.3. Evalution of multifractal methods 

As can be seen in Figure 2, our earlier work, 28 and that of Yu et al., 12 the multifractal technique has some 
promise in classifying bacteria and generating phylogenetic trees but has difficulties distinguishing between 
closely related bacteria. This is due to the fact that it ignores information differences in structure between 
different genomes and also the features of the genomes such as genes and repeat sequences. To determine an 
accuracy rate, we simulated large scale genomic changes as in the previous section for both the E. coli bacteria 
previously used as well as for a set of quite different bacteria which consisted of Bacillus cereus, Shigella 
flexneri, Salmonella typhimurium, Pseudomonas aeruginosa, Mycoplasma pulmonis, Lactobacillus plantarum, 
and Bradyrhizobium japonicum. In addition to simply benchmarking against trees produced by our software, 
we also considered well established relationships between bacteria. 29,30 For each tree we determined an 
accuracy rate by considering the fraction of the tree that was correct and the number of basic tree operations 
(rotates and shifts) needed to correct the tree. The results are given in Table 1 and gave an average of 56% 
correct. 



Table 1. The fraction of the organisms with a correct relationship to each other in the trees are shown for a number 
of simulated sets of evolution and a known set of bacteria. The number of tree operations (shifts and rotates) required 
to turn the output tree into the correct tree are shown. Simulation run 8 is for the set of actual bacteria with a known 
relationship, rather than a simulated set of bacteria. 
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Overall the multifractal technique is useful in determining rough relationships between organisms, but has 
trouble coping with finer details, most likely because it doesn't use enough of the information available about 
the genomes in question. In the following subsection we detail a related information theory approach. 

2.4. Asymptotically optimal universal compression metrics 

Here we use the same set of files we generated in subsection 2.2 and apply a lossless compression algorithm, 
with a coding rate that approaches the Shannon rate 31 ' 32 : 

H (P) = - Pi l °gPi (3) 
PieP 

The Shannon entropy is in fact a special case of the more general Renyi dimension 1 used in Subsection 2.2. 
We use the metric 

M = H(X\Y) 2 = (H(X) - H(X, Y)) 2 , (4) 

where 
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In Equations 5 and 6, x" and y^ are the sequence strings of the two genomes of lengths n and N bits (two 
bits per base), and E [L (•)] is the compressed length operator (again, in units of bits). The tree for the same 
mutations as in Subsection 2.2 is shown in Figure 3. As with the multifractal measure, the information theory 
measure has difficulties determining differences between closely related bacteria. 




Figure 3. This is the tree produced by using an information theory based metric in conjunction with the neighbor 
joining algorithm. As with the multifractal tree, it has trouble distinguishing between closely related genomes, for 
similar reasons. We propose tightening the tree building decision methods using techniques outlined in Gutman. 13 
This will enable us to place accuracy levels on parts of the tree. 



3. CONCLUSIONS 

In general we found that the multifractal and compression metrics we have explored are useful in distinguishing 
between different organisms, they have problems distinguishing between those which are closely related, or 
those which are too small for statistical properties to be estimated with any accuracy. We found the popular 
maximum likelihood and parsimony techniques to handle small changes in short sequences quite well, with 
the parsimony tree building method handling gaps better than maximum likelihood. We used an entirely 
automated approach to the problem of multiple alignment, using the CLUSTALW software package. A skilled 
operator would be able to edit the CLUSTALW results to produce better alignments which would help the 
maximum likelihood and parsimony algorithms further. 

Although the multifractal and compression metrics were found to be useful, the fact that they ignore a 
considerable amount of whole-genome information available such as gene reversals and gene ordering between 
different species means they will always have certain limitations in accuracy. Work can be done to specify 
the accuracy of the metrics produced by these methods. In particular, the compression method can identify 
organisms that it is unable to classify reliably and this could be indicated when drawing trees, the way ahead 
in whole genome comparisons and the resultant phylogenies lies with combining multiple techniques including, 
but not limited to, comparison of sets of aligned protein sequences, comparison of sets of unaligned protein 
sequences, 33 and information on gene order changes. 11,34 Some of these techniques can be time consuming, 
however, requiring significant amounts of human input. Perhaps the best approach would be to combine whole 
genome techniques with those of orthologous gene comparisons and even physiological data in determining 
highly accurate phylogenies. 
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